Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve carryover dependency handling #623

Merged
merged 1 commit into from
Oct 16, 2024
Merged

Conversation

forsyth2
Copy link
Collaborator

@forsyth2 forsyth2 commented Sep 23, 2024

Issue resolution

Select one: This pull request is...

  • a bug fix: increment the patch version
  • a small improvement: increment the minor version
  • an incompatible (non-backwards compatible) API change: increment the major version

1. Does this do what we want it to do?

Objectives:

  • If a task should wait for a previous run of itself, that "carryover dependency" should remain respected.
  • Otherwise, a task should re-establish its own dependencies, without including dependencies a previous instantiation might have required.
  • Dependencies should be listed in the .settings file to ease debugging (i.e., it will be obvious to tell what zppy thought the dependencies were)

Required:

  • Product Management: I have confirmed with the stakeholders that the objectives above are correct and complete.
  • Testing: I have added or modified at least one "min-case" configuration file to test this change. Every objective above is represented in at least one cfg.
  • Testing: I have considered likely and/or severe edge cases and have included them in testing.

2. Are the implementation details accurate & efficient?

Required:

  • Logic: I have visually inspected the entire pull request myself.
  • Logic: I have left GitHub comments highlighting important pieces of code logic. I have had these code blocks reviewed by at least one other team member.
  • Left comments, no block here requires review.

3. Is this well documented?

Required:

  • Documentation: by looking at the docs, a new user could easily understand the functionality introduced by this pull request.
    • This is largely a behind-the-scenes change, enabling zppy to run faster by parallelizing more.

4. Is this code clean?

Required:

  • Readability: The code is as simple as possible and well-commented, such that a new team member could understand what's happening.
  • Pre-commit checks: All the pre-commits checks have passed.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Sep 24, 2024

Test output:

$ cd /lcrc/group/e3sm/ac.forsyth2/zppy_min_case_carryover_dependencies_output/i622-test-carryover/v3.LR.historical_0051/post/scripts

$ grep -v "OK" *status
# No output, good

$ grep -A 5 "'dependencies'" *settings

That gives the following dependencies:

e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1987-1988.settings:

  • climo_atm_monthly_180x360_aave_1987-1988.status
  • tc_analysis_1987-1988.status

e3sm_diags_atm_monthly_180x360_aave_mvm_model_vs_model_1987-1988_vs_1985-1986.settings:

  • climo_atm_monthly_180x360_aave_1987-1988.status [BUG: carried over]
  • tc_analysis_1987-1988.status [BUG: carried over]
  • e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1987-1988.status [BUG; it looks like this is marked as a carryover dependency because e3sm_diags.py is expecting another model_vs_obs after that uses tc_analysis, e.g., 1989-1990, but the mvm run here instead gets stuck with the dependency.]
  • climo_atm_monthly_180x360_aave_1987-1988.status

global_time_series_1985-1995.settings:

  • ts_atm_monthly_glb_1985-1989-0005.status
  • ts_atm_monthly_glb_1990-1994-0005.status
  • ts_lnd_monthly_glb_1985-1989-0005.status
  • ts_lnd_monthly_glb_1990-1994-0005.status
  • mpas_analysis_ts_1985-1989_climo_1985-1989.status
  • mpas_analysis_ts_1985-1995_climo_1990-1995.status

ilamb_1985-1988.settings:

  • ts_land_monthly_1985-1986-0002.status
  • ts_land_monthly_1987-1988-0002.status
  • ts_atm_monthly_180x360_aave_1985-1986-0002.status
  • ts_atm_monthly_180x360_aave_1987-1988-0002.status

mpas_analysis_ts_1985-1989_climo_1985-1989.settings:

  • N/A

mpas_analysis_ts_1985-1995_climo_1990-1995.settings:

  • mpas_analysis_ts_1985-1989_climo_1985-1989.status [GOOD, dependency carried over]

tc_analysis_1985-1986.settings:

  • N/A

tc_analysis_1987-1988.settings:

  • tc_analysis_1985-1986.status [GOOD, dependency carried over]

Copy link
Collaborator Author

@forsyth2 forsyth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chengzhuzhang I've made some comments to better explain this PR. Thanks.

@@ -33,10 +33,12 @@ def e3sm_diags(config, scriptDir, existing_bundles, job_ids_file): # noqa: C901
return existing_bundles

# --- Generate and submit e3sm_diags scripts ---
dependencies: List[str] = []
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is ultimately why you were running into your second problem on #622. The dependencies list is defined before cycling through the tasks. That is, dependencies accumulate rather than start fresh with each task.

That was actually somewhat intentional, because some tasks (notably MPAS-Analysis) relies on previous runs of itself.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@forsyth2 Thank for looking into this. To start fresh dependencies list for each e3sm_diags sub tasks, we can just reset dependencies = [], when looping over tasks, i.e. under line 38 in e3sm_diags.py without touching other part of zppy. I'm not sure why it is needed to introduce another parameter called carried_over_dependencies?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chengzhuzhang we can't start a completely fresh dependencies list on each new task because there is a case where we do want to "remember" dependencies from the previous one. This happens when tc_analysis is a dependency.

In https://github.com/E3SM-Project/zppy/blob/main/zppy/e3sm_diags.py:

                    # Due to a `socket.gaierror: [Errno -2] Name or service not known` error when running e3sm_diags with tc_analysis
                    # on multiple year_sets, if tc_analysis is in sets, then e3sm_diags should be run sequentially.
                    if "tc_analysis" in c["sets"]:
                        # Note that this line should still be executed even if jobid == -1
                        # The later tc_analysis-using e3sm_diags tasks still depend on this task (and thus will also fail).
                        # Add to the dependency list
                        dependencies.append(statusFile)

A socket error occurs if we try to run e3sm_diags in parallel on tc_analysis. To work around that, we simply make e3sm_diags run sequentially when it's working with tc_analysis. (dependencies.append(statusFile) adds the current task to the dependencies for the next task -- this is the point of carried_over_dependencies). This is similar to how we make mpas_analysis run sequentially.

The unfortunate side effect is that e3sm_diags tasks that do not deal with tc_analysis also end up caught in the sequential dependency chaining. And that means they don't run if an earlier task fails.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious how tc_analysis dependency works differently than other dependencies such as climo and time-series. I'm interested in reproducing the socket error. To try to run e3sm_diags in parallel on tc_analysis, does it mean, when tc_analysis task is on, while tc_analysis set is not set to be one of e3sm_diags sets to run, so that tc_analysis and e3sm-diags run in parallel?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious how tc_analysis dependency works differently than other dependencies such as climo and time-series.

In this pull request, it's this line in the task we're finishing up

carried_over_dependencies.append(statusFile)

and this line in the next task:

dependencies: List[str] = carried_over_dependencies

This resets all other dependencies, except for the one we specifically added to carried_over_dependencies (i.e., the last task).

I'm interested in reproducing the socket error

This error occurs intermittently when running two e3sm_diags jobs, both of which run the tc_analysis set. The sequential workaround was introduced in #169. A full discussion of the tc_analysis concurrency/parallelism bugs can be found in #180.

@@ -267,7 +271,7 @@ def e3sm_diags(config, scriptDir, existing_bundles, job_ids_file): # noqa: C901
# Note that this line should still be executed even if jobid == -1
# The later tc_analysis-using e3sm_diags tasks still depend on this task (and thus will also fail).
# Add to the dependency list
dependencies.append(statusFile)
carried_over_dependencies.append(statusFile)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, instead of passing all dependencies on to the next task, only specific tasks (i.e., the current task) will be propagated forward as dependencies.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the bug:

While carried_over_dependencies works for mpas_analysis and tc_analysis, there still seems to be some trouble for e3sm_diags.

From an earlier comment on this PR:

e3sm_diags_atm_monthly_180x360_aave_mvm_model_vs_model_1987-1988_vs_1985-1986.settings:

  • climo_atm_monthly_180x360_aave_1987-1988.status [BUG: carried over]
  • tc_analysis_1987-1988.status [BUG: carried over]
  • e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1987-1988.status [BUG; it looks like this is marked as a carryover dependency because e3sm_diags.py is expecting another model_vs_obs after that uses tc_analysis, e.g., 1989-1990, but the mvm run here instead gets stuck with the dependency.]
  • climo_atm_monthly_180x360_aave_1987-1988.status

I'm not entirely sure why the first two are showing up as dependencies when they're not carried over. I need to investigate the logic to see if zppy is picking them up as dependencies for e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1987-1988.

As for e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1987-1988 itself, as mentioned in the quote block above, the issue is that the presence of tc_analysis is telling zppy that we should run diags sequentially. But we don't actually need to do that because the next task isn't doing any work with tc_analysis. Unfortunately, there's no way to know a priori if the next task will be doing that or not. (We'd need to do a preemptive search of later tasks).

@@ -133,7 +139,7 @@ def mpas_analysis(config, scriptDir, existing_bundles, job_ids_file):
# Note that this line should still be executed even if jobid == -1
# The later MPAS-Analysis tasks still depend on this task (and thus will also fail).
# Add to the dependency list
dependencies.append(statusFile)
carried_over_dependencies.append(statusFile)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The carried_over_dependencies switch works for MPAS-Analysis.

@@ -88,7 +93,7 @@ def tc_analysis(config, scriptDir, existing_bundles, job_ids_file):
# Note that this line should still be executed even if jobid == -1
# The later tc_analysis tasks still depend on this task (and thus will also fail).
# Add to the dependency list
dependencies.append(statusFile)
carried_over_dependencies.append(statusFile)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The carried_over_dependencies switch works for tc_analysis.

@@ -99,7 +103,7 @@ def e3sm_diags(config: ConfigObj, script_dir: str, existing_bundles, job_ids_fil
# Note that this line should still be executed even if jobid == -1
# The later tc_analysis-using e3sm_diags tasks still depend on this task (and thus will also fail).
# Add to the dependency list
dependencies.append(status_file)
carried_over_dependencies.append(status_file)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rather than carried_over_dependencies we have a variable previous_status_file and then if "tc_analysis" in c["sets"] we add that to the dependencies as well. Then, diags not using tc_analysis wouldn't be affected.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Oct 15, 2024

Rebased after the modularizing refactor of #628. Merge conflicts affected:

	both modified:   zppy/e3sm_diags.py
	both modified:   zppy/global_time_series.py
	both modified:   zppy/ilamb.py
	both modified:   zppy/mpas_analysis.py
	both modified:   zppy/tc_analysis.py

and have now been resolved. It looks like the now 22 unit tests are passing.

I had to commit with --no-verify because of:

$ pre-commit run --all-files
[INFO] Initializing environment for https://github.com/pre-commit/pre-commit-hooks.
An error has occurred: InvalidManifestError: 
==> File /gpfs/fs1/home/ac.forsyth2/.cache/pre-commit/repo0641jows/.pre-commit-hooks.yaml
==> At Hook(id='check-added-large-files')
==> At key: stages
==> At index 0
=====> Expected one of commit, commit-msg, manual, merge-commit, post-checkout, post-commit, post-merge, post-rewrite, prepare-commit-msg, push but got: 'pre-commit'
Check the log at /gpfs/fs1/home/ac.forsyth2/.cache/pre-commit/pre-commit.log

which I assume is related to #627.

@forsyth2 forsyth2 mentioned this pull request Oct 15, 2024
7 tasks
@xylar
Copy link
Contributor

xylar commented Oct 15, 2024

@forsyth2, as I said in #627, I don't think this is an issue caused by that update alone. I think a fresh install of pre-commit should take care of it since it worked fine for me on my laptop.

@forsyth2
Copy link
Collaborator Author

Yep, fresh-install worked. Thanks @xylar

@forsyth2 forsyth2 mentioned this pull request Oct 15, 2024
12 tasks
@forsyth2 forsyth2 force-pushed the issue-622-tc-analysis branch 3 times, most recently from c5da603 to 84f8e2c Compare October 16, 2024 00:45
for c in tasks:

dependencies: List[str] = []
Copy link
Collaborator Author

@forsyth2 forsyth2 Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All tasks that weren't changed in this PR already have dependencies reset inside the for c in tasks loop

for c in tasks:

dependencies: List[str] = carried_over_dependencies
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mpas_analysis doesn't have any other dependencies besides previous runs of itself, so we could just move this definition out of the for c in tasks loop, but I think it's more comprehensible to explicitly note the dependency on previous runs is being carried over, via the name carried_over_dependencies

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Oct 16, 2024

The unit tests pass and I ran zppy -c tests/integration/generated/test_min_case_carryover_dependencies_chrysalis.cfg, checking the output directory for failed status files and checking the generated plots. Since tc_analysis currently doesn't work with v3 data and the tc_analysis dependency handling was tested in #631, I just took tc_analysis out of this min-case test.

@forsyth2
Copy link
Collaborator Author

For reference, the dependency output from that min-case test:

$ cd /lcrc/group/e3sm/ac.forsyth2/zppy_min_case_carryover_dependencies_output/test-623-20241015-v4/v3.LR.historical_0051/post/scripts
$ grep "'dependencies'" *settings

We can then find the following:

e3sm_diags_atm_monthly_180x360_aave_model_vs_obs_1987-1988.settings:  'dependencies': [ '.../climo_atm_monthly_180x360_aave_1987-1988.status'],

e3sm_diags_atm_monthly_180x360_aave_mvm_model_vs_model_1987-1988_vs_1985-1986.settings:  'dependencies': [ '.../climo_atm_monthly_180x360_aave_1987-1988.status'],

global_time_series_1985-1995.settings: 'dependencies': [ '.../ts_atm_monthly_glb_1985-1989-0005.status',
                    '.../ts_atm_monthly_glb_1990-1994-0005.status',
                    '.../ts_lnd_monthly_glb_1985-1989-0005.status',
                    '.../ts_lnd_monthly_glb_1990-1994-0005.status',
                    '.../mpas_analysis_ts_1985-1989_climo_1985-1989.status',
                    '.../mpas_analysis_ts_1985-1995_climo_1990-1995.status'],

ilamb_1985-1988.settings: 'dependencies': [ '.../ts_land_monthly_1985-1986-0002.status',
                    '.../ts_land_monthly_1987-1988-0002.status',
                    '.../ts_atm_monthly_180x360_aave_1985-1986-0002.status',
                    '.../ts_atm_monthly_180x360_aave_1987-1988-0002.status'],

mpas_analysis_ts_1985-1989_climo_1985-1989.settings:  'dependencies': [],

mpas_analysis_ts_1985-1995_climo_1990-1995.settings:  'dependencies': [ '.../mpas_analysis_ts_1985-1989_climo_1985-1989.status'],

@forsyth2 forsyth2 merged commit f759fdc into main Oct 16, 2024
4 checks passed
@forsyth2 forsyth2 deleted the issue-622-tc-analysis branch October 16, 2024 01:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Issues with tc_analysis in zppy
3 participants